Search CORE

129 research outputs found

Parking Assistant

Author: Mareček David
Publication venue: Vysoké učení technické v Brně. Fakulta informačních technologií
Publication date: 01/01/2012
Field of study

Tato práce se zabývá návrhem a realizací parkovacího asistenta. Seznamuje s typy senzorů pro měření vzdálenosti a možnostmi využití kamerového systému. Při realizaci je využíváno ultrazvukových senzorů, konkrétně dálkoměrů SRF08 a webových kamer. Také bylo navrženo a implementováno uživatelské rozhraní, které slučuje údaje z jednotlivých senzorů. Parkovací asistent obsahuje funkci pro detekci hran, zvukovou a grafickou signalizaci vzdálenosti spolu s možností automatického nočního režimu.This thesis deals with design and implementation of parking assistant. It introduces the types of sensors for distance measurement and possibilities of using camera system. In implementation there are used ultrasonic sensors, namely rangefinder SRF08 and web cameras. User interface that combines data from individual sensors was designed and implemented. Parking assistant provides function for edge detection, sound and graphics signalization together with automatic night mode.

Digital library of Brno University of Technology

National Repository of Grey Literature

Merged bilingual trees based on Universal Dependencies in Machine Translation

Author: Mareček David
Publication venue
Publication date: 01/01/2016
Field of study

In this paper, we present our new experimental system of merging dependency representations of two parallel sentences into one dependency tree. All the inner nodes in dependency tree represent source-target pairs of words, the extra words are in form of leaf nodes. We use Universal Dependencies annotation style, in which the function words, whose usage often differs between languages, are annotated as leaves. The parallel treebank is parsed in minimally supervised way. Unaligned words are there automatically pushed to leaves. We present a simple translation system trained on such merged trees and evaluate it in WMT 2016 English-to-Czech and Czech-to-English translation task. Even though the model is so far very simple and no language model and word-reordering model were used, the Czech-to-English variant reached similar BLEU score as another established tree-based system

Biblio at Institute of Formal and Applied Linguistics

Towards Parallel Czech-Russian Dependency Treebank

Author: Klyueva Natalia
Mareček David
Publication venue
Publication date: 30/11/2010
Field of study

Proceedings of the Workshop on Annotation and Exploitation of Parallel Corpora AEPC 2010. Editors: Lars Ahrenberg, Jörg Tiedemann and Martin Volk. NEALT Proceedings Series, Vol. 10 (2010), 44-52. © 2010 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/15893

DSpace at Tartu University Library

Tokenization Impacts Multilingual Language Modeling: Assessing Vocabulary Allocation and Overlap Across Languages

Author: Balhar Jiří
Limisiewicz Tomasz
Mareček David
Publication venue
Publication date: 26/05/2023
Field of study

Multilingual language models have recently gained attention as a promising solution for representing multiple languages in a single model. In this paper, we propose new criteria to evaluate the quality of lexical representation and vocabulary overlap observed in sub-word tokenizers. Our findings show that the overlap of vocabulary across languages can be actually detrimental to certain downstream tasks (POS, dependency tree labeling). In contrast, NER and sentence-level tasks (cross-lingual retrieval, NLI) benefit from sharing vocabulary. We also observe that the coverage of the language-specific tokens in the multilingual vocabulary significantly impacts the word-level tasks. Our study offers a deeper understanding of the role of tokenizers in multilingual language models and guidelines for future model developers to choose the most suitable tokenizer for their specific application before undertaking costly model pre-trainingComment: in ACL Findings 202

arXiv.org e-Print Archive

Measuring Memorization Effect in Word-Level Neural Networks Probing

Author: Mareček David
Musil Tomáš
Rosa Rudolf
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 29/06/2020
Field of study

Multiple studies have probed representations emerging in neural networks trained for end-to-end NLP tasks and examined what word-level linguistic information may be encoded in the representations. In classical probing, a classifier is trained on the representations to extract the target linguistic information. However, there is a threat of the classifier simply memorizing the linguistic labels for individual words, instead of extracting the linguistic abstractions from the representations, thus reporting false positive results. While considerable efforts have been made to minimize the memorization problem, the task of actually measuring the amount of memorization happening in the classifier has been understudied so far. In our work, we propose a simple general method for measuring the memorization effect, based on a symmetric selection of comparable sets of test words seen versus unseen in training. Our method can be used to explicitly quantify the amount of memorization happening in a probing setup, so that an adequate setup can be chosen and the results of the probing can be interpreted with a reliability estimate. We exemplify this by showcasing our method on a case study of probing for part of speech in a trained neural machine translation encoder.Comment: Accepted to TSD 2020. Will be published in Springer LNC

arXiv.org e-Print Archive

Crossref

Input Combination Strategies for Multi-Source Transformer Decoder

Author: Helcl Jindřich
Libovický Jindřich
Mareček David
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2018
Field of study

In multi-source sequence-to-sequence tasks, the attention mechanism can be modeled in several ways. This topic has been thoroughly studied on recurrent architectures. In this paper, we extend the previous work to the encoder-decoder attention in the Transformer architecture. We propose four different input combination strategies for the encoder-decoder attention: serial, parallel, flat, and hierarchical. We evaluate our methods on tasks of multimodal translation and translation with multiple source languages. The experiments show that the models are able to use multiple sources and improve over single source baselines.Comment: Published at WMT1

arXiv.org e-Print Archive

Crossref

Edinburgh Research Explorer

Automatické párování tektogramatických stromů z česko-anglického paralelního korpusu

Author: Mareček David
Publication venue: Univerzita Karlova, Matematicko-fyzikální fakulta
Publication date: 01/01/2011
Field of study

Název práce: Automatické párování tektogramatických stromů z česko-anglického paralelního korpusu Autor: David Mareček Katedra (ústav): Ústav formální a aplikované lingvistiky Vedoucí diplomové práce: Ing. Zdeněk Žabokrtský, Ph.D. Abstrakt: Cílem této práce je implementovat a zhodnotit softwarový nástroj pro automatické zarovnávání (alignment) českých a anglických tektogramatických stromů. Úkolem je najít odpovídajicí si uzly stromů, které reprezentují anglickou větu a její český překlad. Velké množství zarovnaných stromů získaných z paralelního korpusu může být užitečné pro trénování modelu pro transfer strojového překladu. Zároveň může posloužit lingvistům při studování překladových ekvivalentů mezi dvěma jazyky. Výsledky našich experimentů ukazují, že přesunutím problému alignmentu ze slovní roviny na tektogramatickou (a) zvýšíme mezianotátorskou shodu (b) můžeme vytvořit alignovací algoritmus, který využívá i stromovou strukturu věty a překoná nástroj pro alignment GIZA++ spuštěný na uzly tektogramatických stromů. To je pravděpodobně zapříčiněno tím, že tektogramatické reprezentace českých a anglických vět si jsou mnohem podobnější než samotné věty na povrchu. Klíčová slova: tektogramatická rovina, word alignment, strojový překladTitle: Automatic Alignment of Tectogrammatical Trees from Czech-English Parallel Corpus Author: David Mareček Department: Institute of Formal and Applied Linguistics Supervisor: Ing. Zdeněk Žabokrtský, Ph.D. Abstract: The goal of this thesis is to implement and evaluate a software tool for automatic alignment of Czech and English tectogrammatical trees. The task is to find correspondent nodes between two trees that represent an English sentence and its Czech translation. Great amount of aligned trees acquired from parallel corpora can be used for training transfer models for machine translation systems. It is also useful for linguists in studying translation equivalents in two languages. In this thesis there is also described word alignment annotation process. The manual word alignment was necessary for evaluation of the aligner. The results of our experiments show that shifting the alignment task from the word layer to the tectogrammatical layer both (a) increases the interannotator agreement on the task and (b) allows to construct a feature-based algorithm which uses sentence structure and which outperforms the GIZA++ aligner in terms of f-measure on aligned tectogrammatical node pairs. This is probably caused by the fact that tectogrammatical representations of Czech and English sentences are much closer...Ústav formální a aplikované lingvistikyInstitute of Formal and Applied LinguisticsFaculty of Mathematics and PhysicsMatematicko-fyzikální fakult

CU Digital Repository